-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing improvements #874
Parsing improvements #874
Conversation
… can now be "covered by" another parser, meaning they will be skipped if the other parser is run. parsersOrder was also cleaned up a tiny bit
5de4ca6
to
0ca3883
Compare
86a089f
to
95b9df6
Compare
cde546c
to
abe03dc
Compare
eb1cb37
to
e0c4de7
Compare
… parsing implementation
I ran a small local test with the IntelliJ profiler to see how effective the drop of exceptions is. The test might not reflect real-world differences, but I emphasised the worst-case scenario (aka a DF with many String columns). The test: val df = dataFrameOf(List(5_000) { "_$it" }).fill(100) {
Random.nextInt().toChar().toString() + Random.nextInt().toChar()
} // a 100r x 5000c DF filled with strings
df.parse() All parsers are run, These are the results:
|
(cherry picked from commit 5c567c5)
7bcfef2
to
37847a3
Compare
# Conflicts: # core/src/main/kotlin/org/jetbrains/kotlinx/dataframe/impl/api/parse.kt
37847a3
to
bd178e3
Compare
…as keys like Double appear multiple times)
@@ -55,10 +67,17 @@ internal interface StringParser<T> { | |||
|
|||
fun applyOptions(options: ParserOptions?): (String) -> T? | |||
|
|||
/** If a parser with one of these types is run, this parser can be skipped. */ | |||
val coveredBy: Collection<KType> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment is clear, but probably naming is weird
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any suggestions? :)
// parse each value/frame column at any depth inside each DataFrame in the frame column | ||
col.isFrameColumn() -> | ||
col.values.map { | ||
async { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need a 2-level coroutine here? How many coroutines will be created during parsing large file, saying 1000 000 rows on 100 columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we need to set up correct dispatcher or give this ability to the user
https://kotlinlang.org/api/kotlinx.coroutines/kotlinx-coroutines-core/kotlinx.coroutines/-coroutine-dispatcher/
In the case, if we run with the Default Dispatcher, it could consume too many resources on the machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need a 2-level coroutine here? How many coroutines will be created during parsing large file, saying 1000 000 rows on 100 columns?
Coroutines are cheap :) DataFrames in the cells of frame columns are independent of each other, as are columns in a DF. It only makes sense to split them off in different coroutine branches.
Probably we need to set up correct dispatcher or give this ability to the user
I agree, but I'm not sure how to do this neatly from the API. The way to do it correctly, would be to make parse()
and all its overloads suspend functions. That way the user can decide which scope to run it on and with which dispatcher. The problem is that the DF API is not built around coroutines, nor should users be forced to put every call in a suspending context, so this would require all overloads to be written twice, both with and without suspend
... @koperagen any ideas?
|
||
// Base case, parse the column if it's a `String?` column | ||
col.isSubtypeOf<String?>() -> | ||
col.cast<String?>().tryParse(options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we throwing for now here some exceptions? in tryParse?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, DataColumn.tryParse
means it tries to parse it, but if it fails, it will keep it as String
. This is in contrast with DataColumn.parse
which does throw an exception if it couldn't be parsed.
I'll add some quick kdoc, because it looks confusing indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry about parallel parsing a little bit
Also I had a thought, what if we enalbe DEBUG logging for catchSilent {
logger.debug ("parsing problems")
}
parallel might need a little redesign indeed. Especially if we'd want to introduce coroutines in other parts of the library where they might be useful. I wouldn't add logging to each individual parser personally, unless we can guarantee that we have 0 impact on performance when the logging level is not debug. But even then, with debug enabled we would generate a TON of logs when parsing. |
I'd try ParallelStream instead of coroutines |
@@ -142,9 +157,340 @@ class ParseTests { | |||
columnOf("2022-01-23T04:29:40").parse().type shouldBe typeOf<LocalDateTime>() | |||
} | |||
|
|||
@Test | |||
fun `can parse instants`() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we ask stdlib for a non-throwing parsing function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, my bad, it's from Java, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in any case, i want to understand better what copy-pasted parts are needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for kotlin Instants, we've got parseOrNull
, a non-throwing parsing function :) For Java I managed to make my own parseOrNull
.
I copied over a lot of tests from kotlinx-datetime such that we can catch functional changes when we update java/kotlinx-datetime versions.
The biggest thing I copied was regarding Kotlin Duration. It would be great if we could have a non-throwing Duration.parseOrNull
in the stdlib. I solved this by creating a function Duration.canParse
which contains copied logic but returns false
instead of throwing an exception. For this I also put plenty of tests in place to catch functional changes if we bump Kotlin versoin.
ParallelStream does work, however, for big json-like DataFrames I'm a bit concerned: |
@zaleslaw @koperagen I adjusted the coroutine implementation such that users can supply how they want the parser to run in their coroutinescope, if they desire to do so. Example can be found in the tests:
I made the parse functions inline as to "leak" the suspend scope inside. The benefit of this notation is that the function can be called both inside and outside suspend functions while still allowing the user to control how it's executed. More explanation here:
|
ba6299c
to
cd7463b
Compare
cd7463b
to
2a90396
Compare
Removing parallel behavior for now. We can discuss it in #723 |
Generated sources will be updated after merging this PR. |
fixes #849
Small logic rewrite for
tryParseImpl
and added kdocs.StringParsers can now be "covered by" another parser, meaning they will be skipped if the other parser is run. It, for instance, makes no sense to check whether a string can be parsed as a
java.time.Instant
if it cannot be parsed as akotlinx.datetime.Instant
.Why I didn't remove parsers that are covered by other parsers is keep open the option to skip some parser types in the future. Say a user wants Java date-time classes instead of kotlin ones, the kotlin ones can be skipped and the java ones will still run. This would need to be implemented separately in the future, but I have plans to implement parser-skipping for CSV readers that already handle the parsing of some types but not all.
More importantly, this PR removes as many exceptions as possible from the "default path" of parsing:
To avoid exceptions I did the following:
kotlinx.datetime.Instant
Instant.parse
callsDateTimeComponents.Formats.ISO_DATE_TIME_OFFSET.parse().toInstantUsingOffset()
instead of the exception-lessparseOrNull()
, so we'll simply use that version instead. Plus, to catch leap seconds, when it fails to parse it, we'll try the java instants too.java.time.Duration
kotlin.time.Duration
return null
. This might be a bit more difficult to maintain, however, I put some tests in place to check behavioral changes on the kotlin side. (Tests that are "inspired" by the official tests)java.time.Instant
,java.time.LocalDateTime
,java.time.LocalDate
,java.time.LocalTime
DateTimeFormatter.parse
directly, we can callparseUnresolved
first. This we can catch failing without it throwing an error. If it does not fail, we call the normalparse
, which has a much lower chance of throwing an exception now.Finally I used coroutines (as suggested in #723) to parallelize the parse operation per-column.